70 research outputs found
A robust approach to model-based classification based on trimming and constraints
In a standard classification framework a set of trustworthy learning data are
employed to build a decision rule, with the final aim of classifying unlabelled
units belonging to the test set. Therefore, unreliable labelled observations,
namely outliers and data with incorrect labels, can strongly undermine the
classifier performance, especially if the training size is small. The present
work introduces a robust modification to the Model-Based Classification
framework, employing impartial trimming and constraints on the ratio between
the maximum and the minimum eigenvalue of the group scatter matrices. The
proposed method effectively handles noise presence in both response and
exploratory variables, providing reliable classification even when dealing with
contaminated datasets. A robust information criterion is proposed for model
selection. Experiments on real and simulated data, artificially adulterated,
are provided to underline the benefits of the proposed method
Group-Wise Shrinkage Estimation in Penalized Model-Based Clustering
Finite Gaussian mixture models provide a powerful and widely employed probabilistic approach for clustering multivariate continuous data. However, the practical usefulness of these models is jeopardized in high-dimensional spaces, where they tend to be over-parameterized. As a consequence, different solutions have been proposed, often relying on matrix decompositions or variable selection strategies. Recently, a methodological link between Gaussian graphical models and finite mixtures has been established, paving the way for penalized model-based clustering in the presence of large precision matrices. Notwithstanding, current methodologies implicitly assume similar levels of sparsity across the classes, not accounting for different degrees of association between the variables across groups. We overcome this limitation by deriving group-wise penalty factors, which automatically enforce under or over-connectivity in the estimated graphs. The approach is entirely data-driven and does not require additional hyper-parameter specification. Analyses on synthetic and real data showcase the validity of our proposal
Sparse model-based clustering of three-way data via lasso-type penalties
Mixtures of matrix Gaussian distributions provide a probabilistic framework
for clustering continuous matrix-variate data, which are becoming increasingly
prevalent in various fields. Despite its widespread adoption and successful
application, this approach suffers from over-parameterization issues, making it
less suitable even for matrix-variate data of moderate size. To overcome this
drawback, we introduce a sparse model-based clustering approach for three-way
data. Our approach assumes that the matrix mixture parameters are sparse and
have different degree of sparsity across clusters, allowing to induce parsimony
in a flexible manner. Estimation of the model relies on the maximization of a
penalized likelihood, with specifically tailored group and graphical lasso
penalties. These penalties enable the selection of the most informative
features for clustering three-way data where variables are recorded over
multiple occasions and allow to capture cluster-specific association
structures. The proposed methodology is tested extensively on synthetic data
and its validity is demonstrated in application to time-dependent crime
patterns in different US cities
A general framework for penalized mixed-effects multitask learning with applications on DNA methylation surrogate biomarkers creation
Recent evidence highlights the usefulness of DNA methylation (DNAm)
biomarkers as surrogates for exposure to risk factors for noncommunicable
diseases in epidemiological studies and randomized trials. DNAm variability
has been demonstrated to be tightly related to lifestyle behavior and exposure
to environmental risk factors, ultimately providing an unbiased proxy of
an individual state of health. At present, the creation of DNAm surrogates
relies on univariate penalized regression models, with elastic-net regularizer
being the gold standard when accomplishing the task. Nonetheless, more advanced
modeling procedures are required in the presence of multivariate outcomes
with a structured dependence pattern among the study samples. In this
work we propose a general framework for mixed-effects multitask learning
in presence of high-dimensional predictors to develop a multivariate DNAm
biomarker from a multicenter study. A penalized estimation scheme, based
on an expectation-maximization algorithm, is devised in which any penalty
criteria for fixed-effects models can be conveniently incorporated in the fitting
process. We apply the proposed methodology to create novel DNAm
surrogate biomarkers for multiple correlated risk factors for cardiovascular
diseases and comorbidities. We show that the proposed approach, modeling
multiple outcomes together, outperforms state-of-the-art alternatives both in
predictive power and biomolecular interpretation of the results
Penalized model-based clustering for three-way data structures
Recently, there has been an increasing interest in developing statistical
methods able to find groups in matrix-valued data. To this extent, matrix Gaussian
mixture models (MGMM) provide a natural extension to the popular model-based
clustering based on Normal mixtures. Unfortunately, the overparametrization issue,
already affecting the vector-variate framework, is further exacerbated when it comes
to MGMM, since the number of parameters scales quadratically with both row and
column dimensions. In order to overcome this limitation, the present paper introduces
a sparse model-based clustering approach for three-way data structures. By
means of penalized estimation, our methodology shrinks the estimates towards zero,
achieving more stable and parsimonious clustering in high dimensional scenarios.
An application to satellite images underlines the benefits of the proposed method
- …